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Abstract 

It is well known that in a supervised classification setting when the number of features, p, 
is smaller than the number of observations, n, Fisher's linear discriminant rule asymptotically 
achieves Bayes rule. However, there are numerous applications where classification is needed in 
the p » n setting. Naive implementation of Fisher's rule in this setting fails to provide good 
results because the sample covariance matrix is singular. Moreover, by constructing a classifier 
that relies on all p features the interpretation of the results is challenging. Our goal is to provide 
robust classification that relies only on a small subset of features and accounts for the complex 
correlation structure. We apply an L\ penalty to the discriminant vector to insure sparsity 
of the solution and use a shrinkage type estimator for the covariance matrix. The resulting 
optimization problem is solved using an iterative coordinate ascent algorithm. Furthermore, we 
highlight the difference between the L\ penalized and the L\ constrained versions of the problem 
and provide justification for feature clustering. The simulation results show that the proposed 
method performs favorably in comparison to alternatives. The method is used to classify 344 
leukemia patients based on 19000 DNA methylation features. 

Keywords: Clustering; Coordinate ascent; Discriminant analysis; Duality gap; Methylation; Penal- 
ization. 



1 Introduction 



Linear discriminant analysis (LDA) is a standard method for classification when the number of ob- 
servations n is much bigger than the number of features p. If data follows p-variate normal distribu- 
tion with the same covariance structure across the groups, LDA provides an asymptotically optimal 



classification rule, meaning that it converges to Bayes rule (Anderson, 1984, Chapter 6). However 



it was noted by Dudoit et al. (2002) that naive implementation of LDA for high- dimensional data 



provides poor classification results in comparison to alternative methods. A rigorous proof of this 



phenomenon in the case p » n is given by Bickel and Levina (2004). There are two main rea- 



sons for this. First, standard LDA uses the sample covariance matrix to estimate the covariance 
structure. In high dimensional settings this results in a singular estimator. Secondly, by using all 
p features in classification, interpretation of the results becomes challenging. Often, only a small 
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subset of features is relevant and it is of great interest to perform classification and variable selection 
at the same time. 

Several methods have been proposed in the literature to account for these problems. One of the 
popular approaches is to use the independence rule which overcomes the singularity problem of the 
sample covariance but ignores the dependency structure. This approach is very appealing because 
of its simplicity and was encouraged by the work of Bickel and Levina ( 2004 ) who showed that it 
performs better than the standard LDA in the p » n setting. Examples of diagonal classification 



metho ds include |Tibshirani et al.| (|2003[ ); |Fan and Fah| Q2008| ); |Pang et al.| ( |2009[ ); |Huang et al 
( 2010P ; |WItten and Tibshirani| d201lf ~ 

Unfortunately, independence is only an approximation and it is unrealistic in most applications. 
Gene interactions, for example, are crucial for the understanding of biological processes and it is 
important to use this information in classification. Therefore we should aim for better estimators 
of the covariance matrix instead of using the independence structure. We are not the first to 
notice that there are drawbacks to the independence approach. Huang et al. (2010) note that 
the discriminant scores resulting from the diagonal rule are biased. However, instead of changing 
the rule, they propose adjusting for the bias directly. Fan et al. (2010) argue that it is crucial 
to take into account the covariance structure. However, they avoid estimating it directly in the 
p >> n setting by doing preselection of features. Shao et al. (2011 ) propose thresholding of sample 
covariance matrix, Cai (2011) estimate the covariance matrix and mean differences directly, and 
Mai et al. (2012) reformulate the LDA problem using penalized least squares. These approaches, 



however, are limited to the case of the two groups and the standard LDA formulation instead of 
Fisher's version is considered. 

The purpose of this work is to extend current methodology in a way that will enable automatic 
variable selection and account for the complex dependency structure. A motivating example is 
the analysis of DNA methylation data from patients with Acute Myeloid Leukemia (AML). The 
ERASMUS data was collected at Erasmus University Medical Center (Rotterdam) and consists 
of DNA methylation profiles of 344 patients. The ECOG data is obtained from a clinical trial 
performed by Eastern Cooperative Oncology Group and consists of DNA methylation profiles of 
383 patients. Samples from both cohorts were processed according to Thompson et al. (2008). 
Cluster analysis performed on the ERASMUS data corresponded well to the available biomarker 
information, providing new insights into the leukemia subtypes based on the methylation patterns 
(see Figueroa et al. (2010) and Kormaksson et al. (2012) for details). Since only limited biomarker 
information is available for the ECOG data, it is of great interest to use the clustering information 
from the ERASMUS data to determine the leukemia subtypes in the ECOG data. 



In this work we consider Fisher's formulation of LDA (Mardia et al. 1979), since it is derived 
without explicit normality assumption on the data. The sparsity of the solution is enforced by 
adding an L\ constraint to the objective function and therefore, the optimization problem consid- 
ered in this paper is similar to the one considered by Witten and Tibshirani (2011). However, we 
gained additional insights in the proposed method. First, we show that using a diagonal estimate 
of the covariance structure in Fisher's LDA (FLDA) with two groups is equivalent to selecting 
features by using t-statistic with an appropriate threshold and therefore we extend the algorithm 
to the non-diagonal case. Secondly, we investigate in more detail the optimization aspects of the L\ 
penalized FLDA problem and show that it is not equivalent to the L\ constrained problem. This 
has important consequences in terms of solution sparsity and tuning parameter selection. 



We propose to evaluate covariance structure through the estimator proposed by Schafer and 
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Strimmer (2005), however our algorithm can be used with any other estimator of a general form. 



The required algorithmic modification provides additional computational challenges and we use the 
coordinate ascent algorithm to solve the optimization criteria. Simulation results show significant 
improvement in misclassification rates over the method proposed by Witten and Tibshirani (2011 ), 
especially when the true covariance structure is far from the diagonal. 

In our simulation study, we noticed that in certain scenarios it is impossible to achieve a very 
sparse solution regardless of the choice of the tuning parameter. We show that this phenomenon 
is due to the duality gap between the Li-penalized and the Li-constrained formulations of the 
problem. With the success of LASSO (Tibshirani, 1996), it is very common to use the L\ penalty 



as a method to achieve sparsity in the solution (for example Zou et al. (2006) in PCA, Bradley and 



Mangasarian (1998) in SVM and Bien and Tibshirani (2011) in covariance matrix estimation). The 



L\ penalty in LASSO is motivated by the dual problem where using the L\ constraint geometrically 
means projecting the solution vector onto the subspace that forces certain components to be exactly 
zero. Since in LASSO the Li-constrained and Li-penalized problems are equivalent, it is natural to 
expect that adding L\ penalty to other objective functions provides the same effect. Unfortunately 
this is not always the case and we show that this is indeed not true in the FLDA context. We 
provide an intuitive explanation for this phenomenon and propose using feature clustering as a 
partial solution. 

The rest of the paper is organized as follows. Section [2] provides the formulation of the prob- 
lem and optimization details. Section [3] describes implementation aspects and justifies clustering. 
In Section [4] we compare the proposed algorithm with competing methods in simulation studies. 
Application to DNA methylation data is presented in Section [5] We conclude in Section [6] with a 
discussion. 



2 Methodology 

We start by considering the following optimization problem, which corresponds to unconstrained 



FLDA ( |Mardia et al.[ |1979[ ) when p < n, 

m.a,~x.v t Bv subject to v l Wv = 1. 

V 

W is the p x p within-group sum of squares matrix and B is the pxp between-group sum of squares 
matrix; i.e. W = Ylf=i n iSi, B = T — W, where T is the total sum of squares matrix, rij is the 
number of observations in the ith group and Si is the sample covariance matrix for the zth group. 
The vector v that solves this problem is the eigenvector corresponding to the maximum eigenvalue 
of the rank min(p, g — 1) matrix W~ 1 B. Moreover, since W is a positive definite matrix, the original 



problem can be rewritten as (Boyd and Vandenberghe 2004) 



max.v t Bv subject to v t Wv < 1. 

V 

Note that —^—W is a sample estimate of the population within-class covariance matrix U w and 

n g 

^pjB is a sample estimate of the population between-class covariance matrix Uj,. Therefore the aim 
of FLDA is to find a linear combination of features that maximizes the between-group variability 
with respect to the within-group variability. 

One of the problems with the discriminant analysis in the p > > n setting is that only a relatively 
small subset of features is relevant. Elimination of features in the FLDA is equivalent to estimating 
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some of the components of vector v exactly as zero. As in LASSO (Tibshirani, 1996), this can 



be achieved by introducing an additional L\ penalty. In the context of FLDA, this approach was 



addressed by Witten and Tibshirani (2011) who consider the following optimization problem: 



vq\ = arg max < 



v*Bv 




subject to v l d\a,g{W)v < 1. 



(1) 



Here A > is a tuning parameter where A = corresponds to unpenalized version of the estimation 
procedure. The larger the value of A, the larger the penalty on v and therefore the smaller the 
number of non-zero components. Note that each feature of the vector v is additionally penalized 
by its sample within-group standard deviation Sj. The solution of Witten and Tibshirani (2011) 
is restricted to the case in which estimator of E w has a diagonal form. Using a nondiagonal 
estimator of U w makes the optimization problem more challenging. This drawback is outweighed 
by the superior performance, which is demonstrated via simulations and a real data application in 
Sections [4] and [5] respectively. 



2.1 Estimation of S w and importance of correlation 



Since the discovery of the well known Stein phenomenon (Stein, 1975), the area of improved es- 



timation for the covariance matrices has been an active research area. A common approach to 
improve the estimation of the covariance matrix is that of shrinkage estimation, since it is well 
known that in high dimensions the eigenvalues of the sample covariance matrix are poor estimates 
for the true eigenvalues. The sample eigenvalues spread over the positive real numbers so that the 
smallest eigenvalues tend to zero, while the largest tend towards infinity. This causes the sample 
covariance matrix to become ill-conditioned or even singular. A number of authors have studied 



the properties of invertible estimates for the covariance matrix, see Bai and Yin (1973), Bickel and 



Levina (2004), Kubokawa and Srivastava (2008), and Ledoit and Wolf (2003) to name a few. In 



particular, Kubokawa and Srivastava (2008) develop several estimators that theoretically improve 



on the Moore-Penrose inverse estimator of the precision matrix, the covariance quantity needed in 
discriminant analysis. 



Some of the main results in Bickel and Levina (2004) rely critically on low rank structure of 



Moore-Penrose inverse of sample covariance matrix, their argument can not be applied to a positive 
definite estimators of U w . Bickel and Levina (2004 Theorem 1) showed that if p/n —> oo, then 



the misclassification rate of LDA that uses the sample covariance matrix goes to 1/2. Their proof 
is based on using Moore-Penrose inverse of sample covariance matrix which has low rank. Since 
the diagonal estimator of U w is always positive definite, they show that independence approach is 
preferable in high-dimensional settings. Indeed, in the case E w is known, the misclassification rate 
of independence rule is always greater than or equal to the misclassification rate of LDA that uses 
E w . The following example illustrates that correlations are beneficial for classification if p is large 
and the number of relevant features (= r) is small, in contrast to Bickel and Levina (2004). 

Consider classification between two groups with the mean of the first group pi = and the 
mean of the second group p2 = S. Also assume all the variables Xj are scaled to have standard 
deviation one within each population. If the variables are jointly normal and equally correlated, 
i.e. coi(xi,Xj) = p for all i ^ j, then the correlation is beneficial if p is negative or if 



P > 



(£^) 2 -Ef=i<5| 



(p-1) 



(2) 
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(Cochran 1964, eqn. 9). Assume now that Sj 
r <p. Then (|2j) can be rewritten as 



for all j > r and 5j = C for all j < r, where 



(Ei=iC) 2 -E 



■3=1 

1 C 2 



c 2 



(3) 



In modern high-dimensional applications it is common to assume that only small subset of features 



is relevant for the classification, i.e. r << p. As a consequence 



c e , where c e is a very small 



constant. It follows from ^ that even small values of p lead to improvements in squared distance. 

Therefore, instead of applying discriminant analysis with W = Ef=i n i^i or assuming indepen- 
dence, we use the following estimator: 



W 



a 



i=i 



where Si = TiTi + (1 — Ti)Si (Schafer and Strimmer, 2005), T, is a diagonal, unequal variance 



estimate and Si is the sample covariance matrix for the ith group. There are several advantages to 
using this estimator. First, the resulting matrix W is always positive definite and it preserves the 



correlation structure of the data. Secondly, an optimal 7$ can be chosen according to Ledoit and 



Wolf (2004) that guarantees minimal MSE under the existence of the first two moments. Therefore 



there are no strong distributional assumptions on the data. Finally, given the simple form of the 
estimator, W can be computed easily and quickly for the large values of p. This approach is similar 
to the one used in regularized discriminant analysis (Friedman, 1989; Guo et al. 2007). Aside 



from the feature selection procedure, our approach is different in that each within-group covariance 
matrix is shrunk towards a diagonal estimate instead of the identity matrix, and the shrinkage 
parameter r» is selected automatically. 

Alternative approaches can be taken to estimate the covariance structure U w . Moreover, it is 
desirable for the estimate to be both positive definite and sparse. Here by sparse we mean that cer- 
tain entries of the covariance matrix are estimated exactly as zero. There is an extensive literature 
on covariance matrix estimators that achieve one of these goals. However, to our knowledge only 
limited methodology is available for achieving both. Among recent advances are methods proposed 



by Bien and Tibshirani (2011) and Rothman (2012). Unfortunately, the estimators considered by 



these authors are very computationally intensive and do not scale well to high-dimensional data 
sets. 



2.2 Algorithm description of Li-penalized version 

After we have defined the appropriate covariance estimator, W, the penalized Fisher's linear dis- 
criminant problem can be expressed in the following form: 

f P \ 

max < v t Bv — \sjVj\ > subject to v t Wv < 1. (4) 

I i =1 J 

Remark: Though we use W to find the discriminant vector v, the proposed algorithm can be 
applied to any positive definite estimator of E w . 
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Before describing the computational aspects, we note that there are problems with solving the 
Li-penalized version of the algorithm instead of the Li-constrained version. In our simulations, 
we noticed that there seemed to be no value of A that resulted in a small number of non-zero 
components. That is, the solution path is not smooth in A. Specifically, there exists a Ao such 
that for all A < Ao the solution vo has at least m non-zero components and, for all A > Ao, 
vq is exactly zero. This phenomenon is illustrated in Figure [4] of Section |4j where more details 
about the simulations are given. This behavior is problematic if our goal is to perform a very 



sparse subset selection. A theoretical explanation for the behavior is provided in Sections 2.4 and 
2.3 Interestingly, although this phenomenon creates challenges in terms of variable selection, the 



simulation studies suggest that the Li-penalized version of the algorithm still has good prediction 
properties. 

The optimization problem ( 4| is non-convex and t herefore we can not apply standard convex 
optimization theory. Following Witten and Tibshirani (2011), we note that Q can be recast as a 
biconvex optimization problem 



max < 2u t B 1 / 2 v - A ^ \sjVj\ - u f u > subject to v l Wv < 1, 



(5) 



since maximizing with respect to u gives u = B l / 2 v. 

The problem ([5]) is convex with respect to u when v is fixed and is convex with respect to v 
when u is fixed. This property allows the use of Alternate Convex Search (ACS) to find the solution 



(Gorski et al. 2007, Section 4.2.1). Although in our context ACS ensures that all accumulation 



points are partial optima and have the same function value (Gorski et al. 2007, Theorem 4.9), the 
convergence to the global optima is not guaranteed due to non-convexity of (|5j). 

Starting with an initial value the algorithm proceeds by iterating the following two steps: 



Step 1 
Step 2 



U M = argmax^u^/V™) - = B l / 2 v^ 



,(m+l) 



ar 



gmax <{ 2{u^) t B 1 ' 2 v - A^ \sjVj\ \ subject to v*Wv < 1. 

3=1 



The main challenge is to solve Step 2. Following Witten and Tibshirani (2011, Proposition 2) 
it is useful to reformulate the problem as 



Jm+l) 



arg max < 



2{u (m) ) t B 1 / 2 q -X^2\s jqj \ - q l Wq 

3=1 



(6) 



where, if q^ m+1 ^ = 0, then t>( m+1 ) = 0, else i;( m+1 ) = g( + L= =. Since problem ((61) is convex 

' ^ ' ' ■ s /( g (™ + l))t W - g (m+l) ^ ^ 

with respect to q, the solution satisfies (Boyd and Vandenberghe 2004) 



2B 1 / 2 u {m) - 2Wq {m+1) - \r = 0, 



(7) 
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where r is a p- vector and each is a subgradient of IsjQ'jl, i.e. 7} = Sj if gj m+1 ' ) > 0, Tj = — Sj if 
" l " ' l ^ < and Fj is between — Sj and Sj if <^ m+1 ^ = 0. 



If W is a diagonal matrix, then wji = for i ^ j and therefore the solution to ([7]) ( | Witt en and 



Tibshirani, 2011) is 



Q 



(m+1) 
j 



5 (5 l /y»)) j . ^ 



Wj 



where S is a soft-thresholding operator, i.e. S(x,a) = sign(x)(|x| — a)+. Note that each component 
of vector g( m+1 ) does not depend on the others and therefore implementation of this step is straight- 
forward. If, however, W has a general form, solving ([7]) with respect to individual components of 
the vector g( m+1 ) gives 



S (B^M) r ^ jW 



(m+1) As 
' ~2 



(? (m+l) = ^ ^ " ^. ((S) 



J.) 



Because of the additional term, w ji9i m+1 \ each component of vector depends on the 

other components and therefore the solution is not available in closed form. It is proposed to use 
coordinate update to overcome this challenge. 

Let f(q) = 2u t B l / 2 q — A^^ =1 \sjqj \ — q t Wq be the objective function and note that / can be 
written as 

p 

f(qx,-,q P ) = fo{qu -q P ) + ^2fj(qj), 

where fo{qi, ••, q P ) = 2u t B 1 / 2 q — q t Wq is concave and differentiable in q and fj(qj) = —X\sjqj\, 
j = 1, ...,p are concave in q. It was established by Tseng (1988) that coordinate ascent methods 
converge for such functions. 

The resulting algorithm to solve ([!]) is the following: 



1. Find W = W as described in Section |2.1 

2. Choose A as described in Section [3TTI 



3. Initialize v as the eigenvector corresponding to the largest eigenvalue oiW l B and normalize 
it so that v l Wv = 1. Set q old = v. These are the starting values for the optimization 
algorithm. 

4. Iterate until convergence 

(a) u^B 1 / 2 v 

(b) Iterate until convergence 

for I = 1 : p 

j = sample (1 : p), sample without replacement 

S ({B l / 2 u)j - Ysi+j wjiqf d , ^) (9) 



new 
n old new 



IV 



J.I 
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gOld 



Note that instead of doing a linear update of all p features, a random update is performed in Q. 



This guarantees a faster convergence rate of the coordinate ascent method (Shalev-Shwartz and 



Tewari, 2009). After each iteration convergence is determined by looking at the sum of absolute 
deviations from the previous solution, D = J2 P j= i \lj eW ~ #"1- Conver 

gence is assumed to be 

achieved when D is smaller than the predefined value e > 0. 

2.3 Discriminating between the two groups using diagonal estimator of E w 

In this section we show that when there are two groups, selecting features using penalized FLDA 
with diag(VF) is equivalent to selecting features with largest absolute value of t-statistic. 

Since g = 2, the rank of B is equal to one and therefore B = jll T , where 7 is the only positive 
eigenvalue of B, and I is the corresponding eigenvector, so l T l = 1. Without loss of generality the 
features of the eigenvector I are ordered, > | Z 2 1 > ••• > \lp\ > 0, and all of the components of 
vector I are non-zero. Indeed, if \lj\ =0 for some j then from ([I]) it automatically follows that 
voj = 0. 

Write W = diag(s|). Consider new variable z such that Zj = VjSj or equivalently z = W l l 2 v. 
Then problem can be rewritten as v$\ = W~ 1 / 2 zq\, where 

p 

z ox = argmax z T W~ 1/2 BW~ 1/2 z - A^|^-| subject to z T z < 1. (10) 

j=i 

Note that vectors vq\ and zq\ have the same support. Moreover, setting B' = W~ 1 ^ 2 BW~ 1 / 2 it 
follows that rank(i? / )=rank(i?)=l, and therefore without loss of generality we can restrict attention 
to the case when W = I. Therefore the optimization problem has the following form: 

vqa = argmaxf T Sf — A||t> ||i subject to v T v < 1. (11) 



Let f(v) = v T Bv — A||f ||i be the objective function. 
Definition 1. For all j , 1 < j < p, define the sets 

Cj = {v £ W : v has exactly j non-zero components} 
and 

Aj = {v £ W : first j components of vector v are non-zero, all other components are zero}. 

Remark: Aj C Cj for 1 < j < p — 1 and Aj = Cj for j = p. 

Also define = {0}, the set that consists of the p-dimensional zero- vector. Consider 

Fj = max f(v) subject to v T v < 1. 

veAjuAo 

Note that Fq = corresponds to f(v) evaluated at v = and F = max„T 1 , <1 f(v) corresponds to 



the original problem (11). 



S 



Proposition 1. Under the constraint v T v < 1 



max f(v) 

vEOjUAo 



max f{v) 



This proposition shows that in case W = I selecting features with an L\ penalty is equivalent 
to selecting features with largest values of |Zj|, i.e. with largest absolute differences between the 
sample means of the two groups. In case W = diag(s|) notice that V = W~ l / 2 l is the eigenvector of 
B' = W~ l l 2 BW' 1 ! 2 . From ( 10 ) it follows that selecting features with an L\ penalty is equivalent to 
selecting features with largest values of |/^| = ™-, i.e. with largest absolute values of the t-statistic. 

To summarize, features selected by solving problem are the same as features selected by the 
t-test with an appropriate threshold. 

2.4 L\ constraint versus L\ penalization 

So far we have considered Li-penalized Fisher's LDA, i.e. the additional L\ norm was incorporated 
into the objective function: 



f IT 

arg max 1 v Bv 



|i} subject to v t Wv < 1, 



(12) 



where we omit the additional penalty weights for simplicity of illustration. 

A closely related problem can be formulated by incorporating an L\ norm directly in the con- 
straint set: 

vot = &Tgmax.v T Bv subject to v t Wv < 1, \\v |h < t, (13) 

veRp 

where t > is a tuning parameter. Note that we only need to consider t > 1 since smaller values 



of t result in i>ot such that VQ t WvQt < 1. Solving problem (13) is computationally more challenging 



than solving problem (|12|) since it additionally involves a binary search before performing update 



Unlike LASSO, there exists a duality gap between problems (12) and (13). For every A > we 



can find t > 1 such that vq\ = vot- The contrary, however, is not generally true. 



First, we will show that every solution to (12) is the solution to (13). Indeed, fix any A > and 



let vo\ be the corresponding solution of (12). Then it follows that for any v such that v Wv < 1 

v^Bvox - X\\v x\\i > v*Bv - A||u||i. (14) 

Now set / 

| v| | ! ) = v l Bv + X(t - |H|i) > v l Bv. 



I^oaIIi m (13). Then for each v such that v l Wv < 1 and ||f ||i < t we have from (14) 
VqxBvqx > v t Bv + A(||u a||i 



This means vqi = vqx and vqa is the solution to (13). 



We now explain why not every solution to (13) is the solution to (12) and discuss the impli 



cations for the algorithm in terms of performance and sparsity level of the solution. In LASSO 
case, the optimization problem is convex and therefore there is a guarantee that there is no duality 



gap between Li-penalized and Li-constrained versions of the algorithm (Bertsekas, 1999, Propo- 



sition 5.2.1). In Fisher's LDA the problem is non-convex and therefore this guarantee no longer 
applies. 

To illustrate when and why the duality gap appears, consider the case p = 2, g = 2 and E w = I. 
In this case B is a 2 x 2 matrix with its rank equal to one. Therefore it is completely determined 
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by its positive eigenvalue and the corresponding eigenvector. Set the eigenvalue equal to one and 
consider two scenarios. In the first scenario, initialize the eigenvector I of matrix B as (0.2,0.8) 
and in the second scenario as (0.5,0.6). In both cases the vectors are normalized to have L2 norm 
exactly equal to one. Note that the features of the eigenvectors are not equal. However, the absolute 
difference between them is smaller in the second scenario (0.1 versus 0.6). 



Following Bertsekas (1999, Chapter 5), we use a geometry-based approach to visualize the 
relationship between solutions to Li-constrained and Li-penalized problems. Consider set S of 



constrained pairs (f(v) 



-v Bv, h(v) 



as v ranges over MP with v l v < 1. Problem 



( 13 ) can be formulated as a minimal common point problem: finding a point with a minimal /th 



coordinate among all points common to both S and halfspace h <t. Furthermore, if it is possible 
to construct a hyperplane supporting S from below at this point, then there exists A > such that 



solutions to (TL2|) and (13) are the same. For both scenarios, the corresponding sets S are plotted 



in Figure [Tj In the first case, for each t > 1 a supporting hyperplane can be constructed at each 
minimal point of the set S that has coordinate h < t. In the second case, no supporting hyperplane 
can be constructed at the minimal point of the set S with h < 1 because it has to lie below the point 
(0,0) and the minimal point corresponding to h = 1.4. Moreover, such a supporting hyperplane 
does not exist for the values of t ranging from 1 to 1.4 and therefore there exists no A such that 
^oa = VQt for those t. The difference between the two scenarios is only in the matrix B, or more 
precisely, in how close the features of the eigenvector of the matrix B are to each other. 



I=(0.2, 0.8) 



l=(0.5, 0.6) 




~\ 1 1 r 

0,0 0.5 1.0 1,5 




~\ 1 1 r 

0,0 0.5 1.0 1.5 



MMIi 



MMI, 



Figure 1: Visualization of the set S, I is the eigenvector of matrix B. 



There is another interesting implication that can be observed from the shape of the sets S. If 
t takes a value that is smaller than a certain threshold (like 1.4 in the second scenario), the only 
supporting hyperplane subject to the constraints intersects the set S at the point corresponding 
to / = and h = 0. Therefore, for such t the corresponding solution to the dual problem (12) is 
exactly v = 0. In other words, the solution path of (12) is not smooth in A and a sudden drop 
in the number of non-zero features can be observed. This is consistent with the results seen in 
simulations. 



The duality gap between the problems (12) and (13) is closely related to undirect restriction on 



the range of values of t, which leads to the restriction on the sparsity level we can achieve by varying 
the tuning parameter A. We have illustrated that the eigenstructure of B plays an important role 
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in this phenomenon. Further we derive sufficient conditions for observing the drop in the number 
of features for the simple case of two groups and diagonal estimator of U w (Proposition |4| . The 
two scenarios described above are consistent with this result: the conditions of Proposition [4] are 
satisfied for the second scenario but not for the first. 

Consider the simple case of discriminating between the two groups and diagonal estimator of S w 



(without loss of generality we consider E w = I). Following the notation of Section 2.3, we derive an 
upper and a lower bound for the maximum Fj of the objective function f(v) = v t Bv — \\\v \\ \ under 
the constraints v v < 1 and v G Aj U Aq (v is either identically or has j non-zero components). 



Proposition 2. Fj < Fj < Fj where 



^ = 7W-Ap" J 



\l J \ 



\V 

and 

Fj = max (o,7\\l j \\l -A 

Here l\ = < % < j} and I is the eigenvector of B. 

It follows from Proposition[l]that F = max(Fo, F\, ...,F p ) = max(0, Fx, F p ) and vq E Uf=o 
Since Ai are disjoint by construction, this means that there exists I such that vq G Ai and vq ^ Aj 
for all j I. Therefore to quantify the drop in number of features, we need to show that there 
exists m such that vq G Aq U Uf= m ^ regardless of the choice of the tuning parameter A. In other 



words, the solution to (11) has at least m non-zero features or is exactly 0. 

We first show that, depending on the value of A, there exists m\ > 1 such that vo G Ao U 

U?=m A A* ■ 

Proposition 3. Define m\ = min{j : 1 < j < p} such that \\P\\2 > ^\f[\- Then vq\ G Aq U 
\Si=mx ^i- If there is no j for which \ \P\\2 > holds, then vq\ = 0. 

Remark: For A < 7|Zi|, the value of m\ increases with A. Moreover, if A > 7|£i| then v§\ = 0. 

Proposition 4. Define m! as max{j : 1 < j < p — 1} for which there exists r > j such that 

\\P\\2 < an d m ' = otherwise. Then vq\ G Aq U Uf= m '+i regardless of the choice of the 

tuning parameter A. 

Combining Propositions [3] and [4] we have that for any A < 7|Zi|, the solution vq\ G Aq U 
Ui , =max(m'+i,m A ) and if A > j\li\ then v \ = 0. Therefore, for the two groups case with U w = I, 



there is a theoretical explanation for the observed behavior that the solution vq\ to problem (11) 
is either vq\ = or has at least m non-zero components, where m > max(m' + l,rri\). 

We performed several simulations to investigate how the value of max(m' + l,m\) varies with 
p and how close the bound we obtained is to the value of m observed empirically. For this purpose 
we generated / as a random vector with each component ^ coming from the uniform distribution on 
[0, 1]. Afterwards, we standardized I to have L2 norm equal to one and ordered the features so that 
\h\ > \h\ > ••• > \lp\ > 0. The results are shown in Figure [2j Dots correspond to the number of 
features selected by the algorithm and the dashed line corresponds to the bound max(m' + 1, m\). 

The simulations indicate that the value of m increases with the value of p. Since in both cases 
the features l{ were generated uniformly at random, the observed relationship between m and p 
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p= 100 , m'+1 = 46 



p= 800 , m'+1 = 302 




Figure 2: Features selected by the algorithm vs theoretical threshold. 



can not be explained by the structural differences in /. However, I is restricted to have L2 norm 
equal to one, which means that the increase in p leads on average to a decrease in the differences 
between the individul features of /. Therefore we argue that the larger are the differences between 
the features of I, the higher level of sparsity can be achieved by solving Li-penalized problem (12). 
Further in Section 3.3 we show that the theoretical bound m' from Proposition [4] is consistent with 
this intuition. 



3 Implementation 

3.1 Selection of the tuning parameter A 

It is traditional to choose the tuning parameter A by cross-validation. However it is not clear in 
this context what value should be considered as the upper bound that sets all the components of 
the vector v to 0. Here we provide some intuition on how this value is chosen. 

From the form of the update on q given in Q, note that all components are set to zero if and 
only if 

(15) 



max 

3 



1/2..M' 



(2B L '* U 



j/ S 3 



< A. 



Since u = B 1 / 2 v, (15) can be rewritten as 



max 

3 



< A, 



where the initial value of v. This leads to an upper bound on A: 



A, 



2 max 

3 



Note that this bound depends on the v^°\ which is natural since v^ ' corresponds to the solution 
of Fisher's discriminant problem without the L% constraint. Cross-validation is then performed on 
a grid over the set [0, X max \- 
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3.2 Producing several discriminant vectors 

Previously we have only considered the problem of finding the first discriminant vector. If there are 
more than two groups it is often the case that this is not enough for classification, and additional 



discriminant vectors should be evaluated. Witten and Tibshirani (2011) address this problem by 



adjusting the matrix B. Let X be an n x p standardized data matrix and let Y be an n x g group 
indicator matrix where Yi g = 1 if observation i is in group g and otherwise. They consider 

b 1 = x T y(y T y)- 1 /2p^(^y)-i/2yT X; 

where the subscript / corresponds to the Zth discriminant vector, = I and P, is an orthogonal 
projection matrix into the space orthogonal to (Y T Y)~ 1 / 2 Y T Xvi for all i < I. Indeed, Y T Y is a gxg 
diagonal matrix where each diagonal element is the number of observations in the corresponding 
group and Y T X sums the features of X by group. 

Though this approach guarantees orthogonality of discriminant vectors, it also provides addi- 
tional challenge in interpretation since the same features may be involved in more than one vector. 
A simpler approach is to find discriminant vectors by reducing the feature space sequentially. After 
the first discriminant vector is produced, the features corresponding to its non-zero components 
are eliminated and then the algorithm is repeated on the remaining features. 

3.3 Clustering 

Let us take a closer look at the conditions of Proposition [4j The larger the value of m' , the 
less sparse are the solutions obtained by varying A. To have a large value for m', we need to be 

mr||3/2 

able to find a large value of j such that there exists r > j with \\P\\2 < jhT|]Wi ' w ^ ere ^ * s the 
eigenvector of B. This inequality can be rewritten as ||P||2Ki|||i r ||i < H^lhlKlli- Note that, given 
the ordering of the features of the vector I and the fact that r > j, we always have \\V\\2 < W r \\2 
and \l\ |||/ r ||i > H^lli- Hence the difference between |^i|||Z r || and ||Z r ||| should be relatively small in 
comparison to the difference between \ \V\\2 an d ||^ r ||2- It follows that the closer the features of the 
vector I are to each other, the more likely this inequality holds leading to a larger value of m! . 

The fact that the eigenvector I corresponds to the differences between the group means leads to 
a more intuitive interpretation of Proposition |4j If a group of features provides similar discriminant 
power in terms of having similar values for the differences between the two groups, the penalized 



criterion (11) either selects the whole group or eliminates the whole group. 

This observation suggests the following strategy. Use distances between the groups to cluster 
the features and create new meta-features as the averages over features belonging to each cluster. 
Then apply the algorithm to the meta-features. There are several advantages to this procedure. 
First, the dimensionality of the original problem is reduced from p to the number of clusters k. 
Secondly, the procedure is justifiable from a biological perspective, since it reflects a reluctance to 
choose only one features among the group of features that have very similar behavior. Finally, the 
differences between the meta-features are larger than the differences between the original features, 
which helps to overcome the problem of finding a sparse solution discussed above. 

4 Simulation results 

The following classification methods are considered for comparison: penalized FLDA proposed by 



Witten and Tibshirani (2011 ); its modification described in this work; and support vector machines 
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with an L\ penalty. In what follows, we refer to these methods as FLDAdiag, FLDA and SVM 
respectively. For the simplicity of illustration, the comparison is performed for the case of the two 
groups assuming the data comes from the multivariate normal distribution. We use the R package 



"penalizedLDA" for FLDAdiag (Witten, 2011) and " penalizedSVM" for SVM (Becker et al. , 2010). 



In all cases, the tuning parameter is selected by cross-validation. 
The following covariance structures are considered: 

1. Diagonal unit variance, U w = I 

2. Blockdiagonal with a network structure. We randomly choose positions for 40 blocks each of 
size 4. Within each block the offdiagonal elements are set to 0.75. Five pairs of blocks were 
chosen randomly to have a correlation of 0.7 between their elements. This initialization is an 
attempt to mimic a network structure in which there is strong correlation between elements 
that are close to one another in addition to strong correlation between certain groups of 
elements that are not necessarily close. Our motivation for considering this structure is based 
on Xiao (2011), who consider spatial correlations between genes linearly separate on the 



chromosome but spatially close with respect to its three-dimensional structure. 



3. 



Simulated from DNA methylation data. The covariance matrix S u 
to 



is estimated according 



Schafer and Strimmer (2005) based on the features selected from the ERASMUS DNA 



methylation data. 



The clustering idea described in Section 3.3 is not implemented in the simulation studies to 
make the comparisons between different methods easier. For each covariance structure, simulations 
are performed over 25 iterations. The training set has 100 observations per group and the test set 
has 500 observations per group. The dimension p is set to 800. At each iteration, the training 
set and the test set are generated from a multivariate normal distribution with means m and \i2 
and covariance structure E w . The mean of the first group is set to n\ = 0, while the mean of the 
second group is non-zero for the first r = 80 features (with values ranging from 0.2 to 0.6) and 
everywhere else. This configuration reflects the p >> n framework with only 10% relevant features. 



Remark: The choice of [12 is influenced by the Example 11.8.1 from Mardia et al. (1979), where 



it is shown that the correlation is especially influential in classification when there is a difference 
in the components of the mean vector. 

The simulation results are summarized in Table [T] Errors correspond to average percent of 
misclassified observations, features show how many variables on average were selected by each 
algorithm and correct features indicates how many features were selected out of original r = 80. 
Mean values are reported and empirical standard deviations over 25 iterations are shown in brackets. 
The misclassification errors are given as percentages. 

Since FLDAdiag explicitly uses the diagonal estimate of the covariance structure, we expected 
FLDA to perform worse for the case U w = I. However, the performance of the two methods is the 
same. SVM performs much worse than FLDA with this covariance structure. 

FLDA performs slightly better for the blockdiagonal covariance matrix with network structure. 
However, the most significant difference is observed for the covariance structure simulated from the 
real data. In this case both FLDAdiag and SVM have much higher misclassification rates than 
FLDA. 

To make sure that the poor performance of FLDAdiag is not affected by the particular choice 
of 800 features, we independently generated additional 6 covariance structures: 3 from ERASMUS 
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Table 1: Comparison of FLDA, FLDAdiag and SVM. 





Covariance matrix 


FLDA 


FLDAdiag 


SVM 


Errors 


Diagonal 


7.1(1) 


7.1(1) 


17.3(2.1) 




Blockdiagonal 


17.9(1.7) 


19.5(2.3) 


25(2.5) 




oiiiiuiaieu. 


o .^tyyj .u j 






Features 


Diagonal 


110(19) 


127(32) 


94(36) 




Blockdiagonal 


138(47) 


116(36) 


133(29) 




Simulated 


365(52) 


114(55) 


66(12) 


Correct 


Diagonal 


60(4) 


63(4) 


35(9) 


Features 


Blockdiagonal 


63(8) 


61(7) 


30(5) 




Simulated 


69(4) 


61(9) 


23(4) 
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Figure 3: Comparison of FLDA (grey) and FLDAdiag (light) on simulated covariance matrices. 



DNA methylation data and 3 from ERASMUS gene expression data. Each structure was generated 
using a different set of 800 features and the prediction error for FLDA and FLDAdiag was calculated 
based on the 10 iterations. The results are presented in Figure [3j Note that FLDA performs 
consistently better than FLDAdiag with approximately a 3- fold reduction in classification errors. 

Though our method results in good prediction performance, the number of selected features is 
still relatively high in comparison to the number of truly different features. This is not a drawback 
of cross-validation, but is an intrinsic property of the L\ penalized problem we are solving. As 
we discuss in Section [2| very sparse solutions are not always obtainable. Indeed, we performed a 
separate study using one of the simulated covariance matrices to evaluate the number of non-zero 
components selected by the algorithm versus the value of the tuning parameter. The results are 
shown in Figure |4| The drop is observed at around 300 features which is only somewhat less that 
the mean value for selected features in Table [TJ 

This property, however, does not significantly affect the performance of the method. Figure [5] 
illustrates how the misclassification rate varies with the value of the tuning parameter A. 
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p = 800 , g = 2 
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Figure 4: Number of non-zero features vs tuning parameter. 



p = 800 , g = 2 
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Figure 5: Percent of misclassified observations in the test set of size 1000. 
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Number of selected clusters vs tuning parameter, p = 18954 




T 
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Figure 6: Number of non-zero features vs tuning parameter, after clustering the features. 



5 Application to DNA Methylation Data 

We applied our approach to the ERASMUS methylation data, where p = 18954. The ECOG data 
was used as a test set. Since t.8.21 (AML1/ETO) and inv.16 (CBFb/MYHll) biomarkers were 
available for both datasets, they were used to asess the performace of the algorithm. Clustering 



was performed using the k-means algorithm (Hartigan and Wong, 1979). The number of clusters 



was chosen to be k = 200 (approximately 1% of the number of original features) and the number of 
random starts was set to 1000. Note that even with such strong dimension reduction, k is still much 
bigger than n since there are 52 patients in both groups in the ERASMUS data. After transforming 
the features, the discriminant algorithm was applied using different values of the tuning parameter 
A. From Figure [6] we can see that the drop in number of selected features is no longer observed. 
This is consistent with results of Section [2] where we explained why separating the features helps 
to achieve sparser solutions. 

Based on the cross-validation results, the tuning parameter A = 0.24 was chosen, which corre- 
sponds to selecting 5 clusters with centers at 1.8, -2.2, -1.7, 2.4 and 1.5. These 5 selected clusters 
contain only 79 out of 18954 original features. Recall that it is not possible to select fewer than 
300 features if no clustering is performed. The classification error on the ECOG data using these 
5 clusters is 0. In contrast, FLDAdiag applied to the same data chooses the tuning parameter that 
selects 32 clusters (1663 original features). To investigate the effect of the number of clusters, k, 
the same data was analyzed with k = 500. In this case the drop in features is again not observed, 
after cross-validation the algorithm selects the tuning parameter A = 0.133, which corresponds to 
selecting 7 clusters containing 73 original features. The prediction error is again 0. Among these 
73 features 51 were among the ones selected with k = 200. 

Given the positive effect of clustering, we applied the same strategy to perform discriminant 
analysis on four methylation subtypes (inv.16, t.8.21, t.15.17 and double mutants). Since there are 
4 subtypes, 3 discriminant vectors can be considered, obtained sequentially as described in Section 



3.2. To summarize, the following steps are performed: 
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1. Choose the first two subtypes. 

2. Cluster the features based on the differences between the two subtypes. 

3. Perform discriminant analysis on clusters using only two selected subtypes. 

4. Determine which original features correspond to selected clusters and eliminate them from 
the feature set. 

5. Merge two subtypes into one and add one subtype from the remaining ones. 

6. Repeat from Step 2 until the final subtype is added. 

The resulting scores for the ERASMUS data are displayed in Figure [7j The resulting scores 
for the ECOG data plotted on the same coordinates are in Figure [8j The scores are based on 5, 5 
and 7 clusters correspondingly (72, 61 and 104 original features). Symbols "1" and "2" correspond 
to subtypes known in both datasets (t.8.21 and inv.16). The figure indicates strong agreement 
between the ERASMUS and the ECOG datasets in terms of subtype classification. 

DNA methylation data sets from the ERASMUS and ECOG studies are available in GEO, 
http://www.ncbi.nlm.nih.gov/geo/, with access number GSE18700 for ERASMUS. The number 
for ECOG is pending. 



6 Discussion 

In this work we consider an extension of Fisher's discriminant analysis to the case where p » 
n. The core of the method is optimization problem Q which poses several challenges. First, 
the problem is non-convex meaning that convergence to the global maximum is not guaranteed. 
Secondly, the matrices E w and £b have to be reliably estimated. Our use of a non-diagonal shrinkage 



estimator for E w leads to improved misclassification rates over the algorithm proposed by Witten 



and Tibshirani (2011) at the expense of some reduction in computational speed. We also identify 
problems with existing algorithms in selecting a sparse subset of features for classification and 
propose a solution involving pre-clustering of features. Software for applying the method described 
in this paper is under development and will be soon available on CRAN as sparseFLDA. 

The standard FLDA framework involves finding the dominant eigenvector of W^ 1 B. In this 
paper we use a shrinkage estimator W in place of W, but an alternative when p » n is to estimate 



the product Z 1 " 1 ^ directly. In the case of two groups, this approach is considered by Cai (2011) 
who proposes to directly estimate i7~ 1 ( / ui — ^2)- In future research we are planning to extend this 
idea to the general, multiple group, setting. 
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Score 1 Score 1 Score 2 

Figure 7: Discriminant scores for 4 subtypes projected onto 2 dimensions, ERASMUS. 




Figure 8: Discriminant scores for 4 subtypes projected onto 2 dimensions, ERASMUS+ECOG. 
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Appendix 
Proof of Proposition [T] 

Proof. Follows from the fact that the eigenvector I of matrix B has sorted features and if | li \ > 
then \vi\ > \vj\. 

For any vector v G MP, Bv = ^yll T v, so (Bv)i = j(l t v)li and sign((Bv)i) = sign((l T v)li). 
Therefore > \(Bv)j\ iff > Since the algorithm update for solving (11) has the form 

m+ i _ sign((^ m ) J )max(0, \(Bv m )j\ - A/2) 
v j 6 

it follows that if \v™ +1 \ > then > for all i < k since > \l 2 \ > ... > \l P \ > 0. □ 
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Proof of Proposition [2] 

Proof. 1. Suppose F\ < F k . By definition F k = ma^g^u^^^j f(v) > f(v') for all v' £ A k U A 
such that v lT v' < 1. Take v' = M ,i M . Therefore it follows 



i=l i=l 
2. Suppose F k < F k . For all t) 6 U such that u T v < 1 we have 

fc A; / k 

^ \h\Uu\ - 

l\h\ 



f( V ) = 7 (i T v) 2 - = tE^n ( E - ^7 ) ^ 

i=l i=l \t=l 



< 7E (ll /fc H2 " ^j) < max (o>7lK fc || 2 (V Ik - ~ ) ) = F k . 



i=i 



□ 



Proof of Proposition [3] 

Proof. If .Ffc < then vq\ ^ ^ since Fq = > F k and i 7 = max(i ? o, Fx, ...,F p ). From here and 
Proposition [2] it follows that a necessary condition for vq\ G A k is that F k > 0. Indeed, if F k = 



then F k <F k = 0. 

Ffc > by definition is equivalent to 



7lrt-A^>0. 



This can be rewritten as ||/ fc ||2(7lK fc ||2 — jj^r) > 0, which is equivalent to ||Z fc ||2 > 

Since \ \l k \\2 < IK m ||2 for all m > k, this means that if F k > then F m > for all m > k. On 
the other hand, if F m = then ||/ m ||2 < ^jj^-j , which means that F k = for all k < m. From here 

it follows that if F m = then vox ^ U^=i Ai which is equivalent to v$\ £ Aq U U?= m +i Ai- ^ 
Proof of Proposition [4] 

Proof. Note that a sufficient condition for F r > Ffc for r > k is F r > F k . From Proposition [2] this 
is equivalent to 

7lin|^-A&>max(o,7||Z fc ||^-A 



V \h 



Now since if F k = then vo\ ^ Ui=i^j) we oru y need to consider the case F k > 0. So we can 
consider 



7lK ll 2 - A n7fir > 1 1 s — 



which is equivalent to 

' " ; "l2 ~ II' 12 



_^ Il;rll2 Il;fcll2 



- < 

7 



lb \h\ 
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We know that F k > iff ^ < |Zi| | \l k \ \ 2 . It follows that Fr>F k when F k > is equivalent to 

||/r||2 _ ||/A;||2 

M \l% < W l 1Mb 



irii 2 ihi 



This can be rewritten as 



||3/2 



Kllll^lll 

Z fc ||2 for all m > fc, the result follows. □ 
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